Introduction
SQL stands for Structured Query Language. It is a programming language designed for managing data in a relational database. SQL SQL has a variety of functions that allow its users to read, manipulate, and change data. In this notebook, present basic to advanced level SQL programing for data analysis, for example filtering, aggregating, grouping, joining tables, subqueries, window functions and so forth.
Table of Contents
- 1 What is SQL
- 2 Select Records
- 3 WHERE CLAUSE, LIKE, IN and BETWEEN
- 4 Join Multiple Tables
- 5 Date and Time Data Types
- 6 Aliasing in SQL
- 7 Joining Multiple Columns and String Values
- 8 Aggregating Data
- 9 Extracting Characters from Strings
- 10 String Length and Character Position
- 11 Creating a Database
- 12 Adding data to a Table
- 13 Insert and Update Statement
- 14 Delete, Drop and Alter Table
- 15 IS NULL and IS NOT NULL
- 16 ANY and ALL
- 17 Over Clause
- 18 Where 1=1
- 19 CONCAT_WS
- 20 Views
- 21 Two Popular SQL Flavors
- 22 ROUND()
- 23 TRUNC() Function
- 24 CASE Expression
- 25 Subqueries
- 26 Window Functions
- 27 COALESCE() Function
- 28 CAST() Function
- 29 Casting with ::
What is SQL¶
SQL ( Structured Query Language) is a domain-specific language used in programming and designed for managing data held in a relational database management system. SQL is used for:
- create databases
- delete databases
- create tables in a database
- update data in a table
- delete data from a table
- read data from a table
- insert data in a table
- delete database tables
- and many more database operations
All data such as Employee data, customer data, 3rd Party data will be stored in database or databases; then companies retrieve data by sql query to analysis and make business decision.
We should connect to server, which is a physical Machine, to access the data. Under each server, there are databases that all data are structured in tables. There are Schema Under each database. Schema is ownership structure. Within each database, there may be different owners (Schema). For example, sales team have their own schema. Under each Schema, there are objects like tables. For example, Person.salary table means it is owned by Person Schema. See Figure below:
Databases are normalized before they use. Normalization is the process of organizing data in database and creating tables and establishing relationships between those tables to make the database more flexible. Figure below shows how normalization works. All tables can be joined with ID table:
Select Records¶
All the tables below are Persona schema (owner):
Top recodes for large data can be achieved by:
WHERE CLAUSE, LIKE, IN and BETWEEN¶
The WHERE clause, LIKE, IN are used to filter records.
It is used to extract only those records that fulfill a specified condition.
LIKE operator finds a pattern and is the same as = for string. Not LIKE is the same as !=.
The queries below finds any records with Name string starts with ‘A’, end ‘D’ and have "-". The % wildcard will match zero, one, or many characters in text.
The IN operator allows to specify multiple values in a WHERE clause. The IN operator is a shorthand for multiple OR conditions.
The BETWEEN operator selects values within a given range. The values can be numbers, text, or dates.
_ match a single character:
Join Multiple Tables¶
A JOIN clause is used to combine rows from two or more tables, based on a related column between them that have the same key. Here are the different types of the JOINs in SQL:
(INNER) JOIN¶
INNER JOIN is the same as JOIN: it Returns records that have matching values in both tables. See Figure below:
We use (INNER) JOIN most of the times.
LEFT (OUTER) JOIN¶
LEFT JOIN is the same as LEFT (OUTER) JOIN: it returns all records from the left table, and the matched records from the right table. See Figure below:
RIGHT (OUTER) JOIN¶
RIGHT JOIN is the same as RIGHT (OUTER) JOIN: it returns all records from the right table, and the matched records from the left table. See Figure below:
FULL (OUTER) JOIN¶
FULL JOIN is the same as FULL (OUTER) JOIN: it returns all records when there is a match in either left or right table. See Figure below:
ANTI LEFT JOIN¶
ANTI RIGHT JOIN¶
ANTI OUTER JOIN¶
CROSS JOIN¶
UNION¶
The UNION operator combines the result-set of two or more SELECT statements. There are conditions for this operator:
- Every SELECT statement within UNION must have the same number of columns
- The columns must also have similar data types
- The columns in every SELECT statement must also be in the same order
The UNION operator selects only distinct values by default. We can use UNION ALL to allow duplicate values.
Date and Time Data Types¶
| Data type | Description |
|---|---|
| DATE | A date. Format: YYYY-MM-DD. The supported range is from '1000-01-01' to '9999-12-31' |
| DATETIME(fsp) | A date and time combination. Format: YYYY-MM-DD hh:mm:ss. The supported range is from '1000-01-01 00:00:00' to '9999-12-31 23:59:59'. Adding DEFAULT and ON UPDATE in the column definition to get automatic initialization and updating to the current date and time |
| TIMESTAMP(fsp) | A timestamp. TIMESTAMP values are stored as the number of seconds since the Unix epoch ('1970-01-01 00:00:00' UTC). Format: YYYY-MM-DD hh:mm:ss. The supported range is from '1970-01-01 00:00:01' UTC to '2038-01-09 03:14:07' UTC. Automatic initialization and updating to the current date and time can be specified using DEFAULT CURRENT_TIMESTAMP and ON UPDATE CURRENT_TIMESTAMP in the column definition |
| TIME(fsp) | A time. Format: hh:mm:ss. The supported range is from '-838:59:59' to '838:59:59' |
| YEAR | A year in four-digit format. Values allowed in four-digit format: 1901 to 2155, and 0000.MySQL 8.0 does not support year in two-digit format. |
Aliasing in SQL¶
Aliasing can be applied for both table name and column name. The reason to use aliasing is because of:
- Less code
- Readability
- Better Column Names
Figure below shows a query with aliasing for tables:
Figure below shows a query with aliasing for both tables and column names:
Joining Multiple Columns and String Values¶
We can merge multiple columns by + sign: Column1+Column2.... see examples below:
Merging columns and adding a character between two columns:
Apply aliasing after merging columns:
We can use Concat for newer version instead of +.
Aggregating Data¶
GROUP BY: groups rows that have the same values into summary rows
GROUP BYis often used with aggregate functions (COUNT(*)(number of records) ,MAX()(Maximum),MIN()(Minimum),SUM(),AVG()(Average) to group the result-set by one or more columns.
- Null values are excluded from aggregation.
ORDER BY: sort the result-set in ascending (DESC) or descending (ASC) order.
See example below
GROUP BYandCOUNT(*)
GROUP BY,ORDER BYandCOUNT(*)
GROUP BY,ORDER BY,MIN()andMAX()
GROUP BY,ORDER BY,MIN(),MAX()andCOUNT(*)
GROUP BYandSUM()
GROUP BY,SUM()andAVG
WHERE clause does not get aggregated values. Instead, we should use HAVING Clause.
Extracting Characters from Strings¶
We can use Left, Right and Substringto extract characters of records. See examples below:
- Using
LEFT
- Using
SUBSTRINGto Extract 3 characters from a string, starting in position 2:
String Length and Character Position¶
LENis used to calculate length of string record:
CHARINDEX is used to find position of a character in string:
Creating a Database¶
We can use CREATE TABLE to design a table manually or programmatically:
We should always select a column IDENTITY as primary key. It is for connecting tables which is an ID number that does not allow NULL.
Adding data to a Table¶
Right click on database and click on “Edit Top 200 Rows” and then add data:
Insert and Update Statement¶
For automatically insert in your database:
INSERT INTO table_name (column1, column2, column3...)
VALUES (value1,value2,value3...)
UPDATE can be used to add records:
Delete, Drop and Alter Table¶
DELETE * FROM Current_Customers, will delete all the records from the table but the design will not disappear. Drop statement totally removes a table. See examples below:
The ALTER TABLE statement is used to add, delete, or modify columns in an existing table.
The ALTER TABLE statement is also used to add and drop various constraints on an existing table.
IS NULL and IS NOT NULL¶
ANY and ALL¶
The ANY and ALL operators allow to perform a comparison between a single column value and a range of other values.
The ANY operator:
- returns a boolean value as a result
- returns TRUE if ANY of the subquery values meet the condition
ANY means that the condition will be true if the operation is true for any of the values in the range.
Table below is Products
Table below is OrderDetails
query below show Any:
| Condition | Meaning |
|---|---|
| x = ANY (…) | The values in column c must match one or more values in the set to evaluate to true. |
| x != ANY (…) | The values in column c must not match one or more values in the set to evaluate to true. |
| x > ANY (…) | The values in column c must be greater than the smallest value in the set to evaluate to true. |
| x < ANY (…) | The values in column c must be smaller than the biggest value in the set to evaluate to true. |
| x >= ANY (…) | The values in column c must be greater than or equal to the smallest value in the set to evaluate to true. |
| x <= ANY (…) | The values in column c must be smaller than or equal to the biggest value in the set to evaluate to true. |
The ALL operator:
- returns a boolean value as a result
- returns TRUE if ALL of the subquery values meet the condition
- is used with SELECT, WHERE and HAVING statements
ALL means that the condition will be true only if the operation is true for all values in the range.
| Condition | Meaning |
|---|---|
| c > ALL(…) | The values in column c must greater than the biggest value in the set to evaluate to true. |
| c >= ALL(…) | The values in column c must greater than or equal to the biggest value in the set to evaluate to true. |
| c < ALL(…) | The values in column c must be less than the lowest value in the set to evaluate to true. |
| c >= ALL(…) | The values in column c must be less than or equal to the lowest value in the set to evaluate to true. |
| c <> ALL(…) | The values in column c must not be equal to any value in the set to evaluate to true. |
| c = ALL(…) | The values in column c must be equal to any value in the set to evaluate to true. |
Over Clause¶
The OVER clause in SQL Server is used with PARTITION BY to break up the data into partitions. Here is an example:
Over clause can be used with with Lead function which give
Where 1=1¶
The statement where 1=1 in SQL means true. It is the same operation as running the select statement without the where clause.
In most cases, you will only need to use this clause when you need to build dynamic SQL statements. Once you apply the where 1=1 clause, all the subsequent statements after it can start with the ‘and’ keyword.
Consider an example where you are guessing an id of a column. Assuming you are not sure whether that id exists on the database, you can use something like where 1=1 to return all the rows even if the target id is not on the database.
The query above uses an or statement. Hence, only one of the conditions needs to be true for the query to work.
Even if there is no user with an id of 10, the 1=1 will always evaluate to true, and the query will fetch all the rows in the specified table.
CONCAT_WS¶
The CONCAT_WS() function adds two or more strings together with a separator.
CONCAT_WS(separator, string1, string2, ...., string_n)
Add strings together. Use - to separate the concatenated string values:
Here is another example:
Views¶
- A view is a virtual table that is the result of a saved SQL
SELECTstatement - When accessed, views automatically update in response to updates in the underlying data
CREATE VIEW employee_years AS
SELECT id, name, year
FROM employees;
Then we can make a query to get the result from view:
SELECT id, name
FROM employee_years;
Two Popular SQL Flavors¶
PostgreSQL
- Created at the University of California, Berkeley
- Free and open-source relational database system
- "PostgreSQL" refers to both the PostgreSQL database system
SQL Server
- Created by Microsoft
- Has free and paid versions
- Queries using T-SQL
The difference between different types of SQL is similar to American and British English. There are key function in common. For example PostgreSQL uses LIMIT while SQL uses TOP at the end of query.
ROUND()¶
Round a number to a specific decimal
Image retrieved from https://www.datacamp.com/
We can also round it to whole number:
Image retrieved from https://www.datacamp.com/
Negative value will round the digits from left:
Image retrieved from https://www.datacamp.com/
TRUNC() Function¶
Return a number truncated to decimal places.
TRUNC(number, decimals)
if decimals>0 it returns number of digits after decimals:
SELECT TRUNC(15.47895, 2)= 15.47
if decimals<0 it returns number of digits before decimals:
SELECT TRUNC(12345, -3)= 12000
CASE Expression¶
The CASE expression goes through conditions and returns a value when the first condition is met (like an if-then-else statement). So, once a condition is true, it will stop reading and return the result. If no conditions are true, it returns the value in the ELSE clause.
If there is no ELSE part and no conditions are true, it returns NULL.
See example below:
CASE statement can be located inside WHERE cluse.It can be ended to END IS NOT NULL to not have Null values:
CASE WHEN with Aggregate¶
- CASE statements are great for
- Categorizing data
- Filtering data
- Aggregating data
Image retrieved from https://www.datacamp.com/
- CASE WHEN with
COUNT
Image retrieved from https://www.datacamp.com/
- CASE WHEN with
SUM
Image retrieved from https://www.datacamp.com/
Image retrieved from https://www.datacamp.com/
Subqueries¶
A subquery is a query nested inside another query:
Subqueries can be placed in any part of a query:
SELECTFROMWHEREGROUP BY
It can return scaler, a list or a table.
A subquery is nested inside another query that can be run on its own.
The example below shows a subquery inside a where clause:
The query
can be run on its own.
SQL first process inside subquery as a single value table or list then the outer query is processed
Subquery in FROM¶
Subquery in where clause returns a single column, but what if we want to get more complex query. Subquery in from gives restructure and transform your data. For example:
- Transform Data
- Prefilter Data
- Calculate aggregates of aggregates
Usually we should make the subquery first and then add to main query.
Image retrieved from https://www.datacamp.com/
See the query below that used subquery to make a new table using AS then select its table in SELECT statement:
Below is a complex subquery in FROM within outer query:
Subquery in Select¶
- Subquery in Select should return a single value, otherwise it will generate an error
- Should properly filter both the main and the subquery. Since subquery is processed before the main query, we only need to include the relevant filter int the subquery as well as the main query.
. See query below:
Correlated subquery¶
- Correlated subquery uses values from outer query to generate a result
- Subquery is re-run for every row generated in the final data set
- It is usually used for advanced joining, filtering ...
| Simple Subquery | Correlated Subquery |
|---|---|
| Can be run independently from main query | Is dependent on the main query to execute |
| Evaluated only once | Evaluated in loops. This leads to significantly slow down query runtime. |
Figure below shows simple subquery on the left and correlated subquery on the right
Another complex correlated subquery:
Nested Subqueries¶
Subqueries nested inside other subqueries.
The EXTRACT() function extracts a part from a given date. For example EXTRACT(part FROM date): SELECT EXTRACT(WEEK FROM "2017-06-15");
Query below shows an example of simple subquery:
Another complex example of subquery:
WITH clause (Common Table Expressions)¶
The SQL WITH clause was introduced by Oracle. The SQL WITH clause allows a sub-query block a name, which can be referenced in several places within the main SQL query. It can be considered as a function.
The WITH clause is considered “temporary” because the result is not permanently stored anywhere in the database schema.
CTE is only run once and store in memory. This leads to improve the amount of time to run your query.
CTE leads to improve organization of queries for long query
See complex example below:
Only the first CTE is declared using WITH. Afterward, use the cte's name, away, and AS.
Here is another complex cte query
Window Functions¶
It is impossible to get every single none aggregate value without `GROUPING BY`: it is not possible to get aggregate values with none aggregate values. Query returns an error. To resolve this, we can use window function. The PARTITION BY clause allows to calculate separate "windows" based on columns that we want to divide results.
Over¶
- Calculation is performed on an already generated result set (a window)
- Aggregated calculations can be applied without grouping data as Running totals, rankings, moving averages
Both queries below give the same answer. The left query is subquery but right query is using window function
The query below shows what the rank of matches based on number of goals. Window functions allows to creates a RANK() of information according to any variable and also sort the data. When setting this up, you will need to specify what column/calculation you want to use to calculate your rank. This is done by including an ORDER BY clause inside the OVER() clause:
This option is available in PostgreSQL, Oracle, MySQL, SQL Server....but NOT SQLite.
The OVER() clause allow to pass an aggregate function down a data set, similar to subqueries in SELECT. However, the OVER() clause has significant benefits over subqueries in select: 1- queries will run faster, 2- the OVER() clause has a wide range of additional functions and clauses.
OVER with a PARTITION¶
- Partition allows you to calculate separate values for each categories.
- Using only one column of data for different calculation
- It is simply calculated by
PARTITION BYwithinOVER:
AVG(ABC) OVER(PARTITION BY DEF)
First use an aggregate function to calculate aggregate values then such as average of ABC in above, then add an OVER clause afterward and inside the parentheses, say PARTITION BY followed by the column you want to average by (DEF)
Query below shows how many goals scored in each match, and how to compare with the season's average??
We can use partition by with multiple columns just having multiple columns after PARTITION BY. See query below:
Sliding Windows¶
- Sliding functions perform calculation relative to the current row of data set
- Can be used to calculate one row of data set at a time.
- Can be partitioned by one or more columns
The code is written as
ROWS BETWEEN <start> AND <finish>
PRECEDINGandFOLLOWINGare the rows before and after the current row, respectivelyUNBOUNDED PRECEDINGANDUNBOUNDED FOLLOWINGshowing the row from beginning or the end of data set in your calculationCURRENT ROWtells sql you want to stop running at current row
The query below calculates sum of goals manchester city played as home team during '2011/2012' and then turn the calculation into running total order by day of the match and calculating from beginning of data set to the current row
The query below calculate the sum of goals between previous and current match.
COALESCE() Function¶
- Operates row by row
- Returns first non-NULL value
CAST() Function¶
One variable type can be converted to another type via cast() function. For example, the query below givs integer of 14.3 which is 14.
We can do this for entire column:
Casting with ::¶
Another way of casting is using :: notation